from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_errorfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import make_scorerimport pandas as pdimport plotly.express as pxfrom sklearn.metrics import accuracy_scoreimport plotly.io as piofrom sklearn.ensemble import HistGradientBoostingClassifierimport plotly.graph_objs as goimport math
Overview
Theme
The theme of this kaggle competition is to forecast grocery sales in Ecuador. What makes it interesting is that it not only provides basic data and turnover data for local stores but also includes data on Ecuador’s oil prices, holidays, and events. These datasets are not directly related to grocery sales data but do impact grocery sales. This aspect makes the project both intriguing and challenging. In this project, we will predict how these external factors affect the transactions and sales of grocery stores.
Why
We decided to take on this kaggle because it will be helpful for business owners and store owners. It would be good to know when to open up a store and what to expect from it such as sales and transactions based on different external factors that will affect it. Additionally, store owners should also know sales and transactions forecast to understand when to restock or buy items, or when to close their store since nothing is selling during this time.
Question
How is the trend of transaction in Ecuador changing overtime, which is also affected by different external factors such as oil and holiday, impacting the total store sales. We aim to explore how these external factors have influenced local store transactions and attempt to predict their impact on transactions. If possible, we will compare the forecast results with and without these external factors to see the different and predict the sales after 2017 (2017-2018).
Dataset
The dataset consists of four parts. The stores.csv contains basic information about local stores, including store numbers, city and state locations, as well as the type and cluster of the store. The holidays_events.csv covers all holidays and events from 2012 to 2017, categorized as local, regional, and national. The oil.csv represents Ecuador’s daily oil prices, a significant factor affecting the local economy. Lastly, the transactions.csv details the daily transaction numbers for each store identified by its number. Additionally, the original dataset also consist of their own test and train dataset. We will use their train and test dataset to forecast previous sales, since sales is created through their own calculation, as well as making our own train and test dataset to predict future transaction.
Project includes the sales for each product family and store combinations, the transactions over year, interpolation for oil prices, data manipulation for holiday and events data, effects of outside variables such as earthquakes.
Project includes relationship between sales, transaction, and oil prices. Also includes effects of holidays and seasonality, and forecasting with machine learning.
Critique of prior work
Ekrem Bayar
What did they do well that is inspiring? They explained every step they worked on with a reasonable texts and plot to show the evidence.
What could they improve on or explore further? I think they did a perfect job on this project, and the only defect I would say it that the dataset is too big so whenever we try to interpret a plot, we can only tell the trend but details.
Do you trust their results? Why or why not? Yes, I do. Like I mentioned before, they came up with reasonable explanations and plots to validate their results.
Howoo Jang
What did they do well that is inspiring? They gave conclusions for each step which helps viewers understand.
What could they improve on or explore further? They didn’t have any introduction about the project in the beginning and started with bunch of codes which makes the viewers confused.
Do you trust their results? Why or why not? I do trust their results because they gave the reason for every chunk of codes and conclusions of the results.
Extension
In this project, we will be extending prior work by comparing the forecast and model with and without external factors and predicting the future transaction. This will include shifting the transaction value down by one day.
Approach
Problem Description
We will predict the transactions for the product at stores located in Ecuador after 2017. Through the variable transactions, oil price, and holiday, we will try to predict the future transactions. In order to do this, we will analyze the whole data to predict the future transaction. However, due to the extensive scale of the dataset, to ensure the feasibility of the analysis and the accuracy of the model predictions, we have narrowed down the data, selecting only the period from August 15, 2016, to August 15, 2017, for the first attempt. The second and third attempt uses the whole data, which made it more complicated and less accurate.
Provenance
Link: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data?select=train.csv The data set used in this project originates from the Kaggle website, specifically from a competition focused on forecasting sales in Ecuadorian stores. The data spans from 2012 to 2017, counting each day.
The daily oil price wasn’t recorded on the day 2013-01-01.
Data Analysis
It is easy to merge the dataset together because the key value would be date and store number. These two key value are very useful and common. Additionally, the date records each day which makes it very detailed and useful for interpretation. However, because of that it is very complex. Some dates are missing or not recorded for some store numbers, which could be a result of no transactions or sales being made for that store.
show_plot( px.box( sales, x ="city", y ="transactions" ))
From this chart, we can observe that the transaction count in Quito is significantly higher than in other cities. This could be because Quito is the capital, and its residents may have greater purchasing power.
Code
header_values = ['KPI', 'Value']cell_values = [ ['Number of Stores', 'Number of Different Products', 'Window Start Date', 'Window End Date','#Rows in training set', '#Date Points'], [train['store_nbr'].nunique(), train['family'].nunique(), train['date'].min(), train['date'].max(), train.shape[0], train['date'].nunique()]]show_plot(go.Figure(data=[go.Table(header=dict(values=header_values), cells=dict(values=cell_values))]))
Code
train_aux = train[['date', 'sales', 'onpromotion']].groupby('date').mean()train_aux = train_aux.reset_index()show_plot(go.Figure(data=go.Scatter(x=train_aux['date'], y=train_aux['sales'], marker_color='red', text="sales")).update_layout({"title": 'Avg Sales by date for all stores and products',"xaxis": {"title":"Date"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))
Increasing trend of the sales. In the last two years (since July 2015) the trend has been stable (almost stationary TS). By zooming in one can realize that there is a seasonality every 7 days, same pattern (during the weekends higher sales). The peak of the weekly sesoanlity is on Sundays/Saturdays. The 1st of Jan of every year the supermarkets are not open (sales=0).
Code
show_plot((px.scatter(train_aux[train_aux['onpromotion'] >0], x="onpromotion", y="sales", color='sales', color_continuous_scale="earth", size='sales', log_x=True, size_max=30).update_layout({"title": 'Correlation between OnPromotion and Sales (total avg sales and promotion each day)',"xaxis": {"title":"On Promotion"},"yaxis": {"title":"Sales"},"showlegend": False})))
From this chart, we can see that promotional tactics have a significant impact on increasing sales volume.
From this chart, we can see that from 2013 to 2017, the overall sales volume trend has been steadily increasing.
Code
show_plot( px.bar(df_month, x='month', y='sales', color='sales', color_continuous_scale="puor").update_layout({"title": f'AVG SALES BY MONTH',"xaxis": {"title":"MONTH"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))
From this chart, we can see that the sales volume in December is significantly higher than in other months. This could be due to major holidays contributing to increased purchases, such as Christmas.
Code
show_plot(px.bar(df_day_of_week, x='day_of_week', y='sales', color='sales', color_continuous_scale="speed") .update_layout({"title": 'AVG SALES BY DAY OF WEEK',"xaxis": {"title":"DAY OF WEEK"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))
We can see that the sales volume on weekends is significantly higher than during the weekdays.
From the chart, we can observe that oil prices were relatively stable from 2013 to 2014. However, there was a significant drop in oil prices from 2014 to 2015. This downward trend continued into 2016. Starting from 2016, oil prices began to stabilize again, but they were still considerably lower compared to the levels in 2013.
Code
sales_oil = train.groupby('date').mean()['sales']sales_oil = sales_oil.reset_index()sales_oil = pd.merge(sales_oil, oil, on ='date', how='left')# we don't have all the oil prices available, we impute them previous_price =93.14for index, row in sales_oil.iterrows():if math.isnan(row['dcoilwtico']): sales_oil.loc[sales_oil['date'] == row['date'], 'dcoilwtico'] = previous_priceelse: previous_price = row['dcoilwtico']
/tmp/ipykernel_491/2593954.py:1: FutureWarning:
The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Code
show_plot(px.scatter(sales_oil, x="dcoilwtico", y="sales", size='sales', color='sales', color_continuous_scale="pinkyl").update_layout({"title": f'Correlation between Oil Prices and Total Average Sales',"xaxis": {"title":"Oil Price"},"yaxis": {"title":"Sales"},"showlegend": False}))
We observed that higher oil prices correspond to a significant decrease in sales values.
Code
show_plot( px.line( sales, x ="date", y ="transactions", facet_col ="type_x", facet_col_wrap =2 ))
Type a and type d seems to have higher transactions, but we are not sure what these type mean because it is not written in the description in kaggle.
Summary
Each EDA graph shows a clear trend, whether it is increasing or decreasing. Each graph shows that the sales and transactions are increasing every year or if a positive external factor is applied such as promotion, holidays, and season. Sales and transactions are decreasing when the oil price is increasing, which makes since oil prices reflects the price of products to be more expensive.
The variable we are trying to predict is transaction. We will attempt training the model in 3 different ways.
First attempt
Code
transactions_query = sales.query("'2016-08-15' <= date <= '2017-08-15'")transactions_query.head(5)
date
store_nbr
transactions
city
state
type_x
cluster
type_y
locale
locale_name
description
transferred
dcoilwtico
65412
2016-08-15
25.0
840.0
Salinas
Santa Elena
D
1.0
Holiday
Local
Riobamba
Fundacion de Riobamba
False
45.72
65413
2016-08-15
1.0
1776.0
Quito
Pichincha
D
13.0
Holiday
Local
Riobamba
Fundacion de Riobamba
False
45.72
65414
2016-08-15
2.0
1789.0
Quito
Pichincha
D
13.0
Holiday
Local
Riobamba
Fundacion de Riobamba
False
45.72
65415
2016-08-15
3.0
2938.0
Quito
Pichincha
D
8.0
Holiday
Local
Riobamba
Fundacion de Riobamba
False
45.72
65416
2016-08-15
4.0
1348.0
Quito
Pichincha
D
9.0
Holiday
Local
Riobamba
Fundacion de Riobamba
False
45.72
The first approach involved the direct use of data from August 15, 2016, to August 15, 2017. We merged the original data ourselves, including stores.csv, oil.csv, holidays_events.csv, and transaction.csv. The resulting dataframe was not based on the train and test data provided by this competition. However, this time, we did not attempt to incorporate time lags.
Setup
Code
#Selecting specific columns and removing missing valuesnew_transactions = transactions_query[['date', 'store_nbr', 'transactions']]new_transactions = new_transactions.dropna()#Define target variable and featurestarget ="transactions"features = [i for i in new_transactions.columns if i != target]train_original, test_original = train_test_split(new_transactions, test_size=0.2, random_state=42)print(f"train_original: {len(train_original)} transactions, test_original: {len(test_original)} transactions")train_original['date'] = pd.to_datetime(new_transactions['date'])train_original['date'] = train_original['date'].apply(lambda x: x.timestamp())test_original['date'] = pd.to_datetime(new_transactions['date'])test_original['date'] = test_original['date'].apply(lambda x: x.timestamp())
We will use date and store_nbr as features because we want to know the total transaction of each store per dates. We will use MAE and MAPE to measure accuracy. We expect these three models to produce relatively normal MAE and MAPE values. Also, we will use linear regression, random forest regressor, and decision tree regressor to validate because our data is continuous.
We merely adjusted the maximum depth of the Random Forest to achieve better MAE and MAPE values. Currently, Random Forest Regressor performs the best because the test result MAE is the lowest, 177.701769. However, with a training set MAPE of only 0.05 and a testing set MAPE of just 0.11, this indicates a potential risk of overfitting. Next, we will proceed with the cross-validation of these models.
Cross Validation
Code
# Cross-validation of Random Forest Model# Reprocess the data to ensure that the date column is converted to a timestampnew_transactions['date'] = pd.to_datetime(new_transactions['date'])new_transactions['date'] = new_transactions['date'].apply(lambda x: x.timestamp())# Update the feature list, ensuring that the target column is not includedfeatures = [i for i in new_transactions.columns if i != target]# perform cross-validationrf_scores = cross_val_score(RandomForestRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')rf_mae =-rf_scores.mean()print(f"Random Forest MAE: {rf_mae}")
Random Forest MAE: 277.64885674827
Code
# Cross-validation of decision tree modelsdr_scores = cross_val_score(DecisionTreeRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')dr_mae =-dr_scores.mean()print(f"Decision Tree MAE: {dr_mae}")
The outcomes from these two predictive models were different, with the Random Forest Regression showing superior performance. MAE 67.768521 MAPE 0.047034 dtype: float64 MAE 177.656220 MAPE 0.111345 dtype: float64
After cross-validation, Random Forest MAE: 276.70372007575213
Second attempt
The second time, we started to attempt incorporating time lag into the original transaction data provided by the competition. We still chose three models: linear regression, decision tree, and random forest. The features consist of the newly added lagged data of transactions. The target variable remains the prediction of transactions.
Lagged Features
Code
# Lagging Feature first try# Train the model with data lagged/shift by one day# tidy datatransactions['date'] = pd.to_datetime(transactions['date'])transactions.sort_values(by='date', inplace=True)# Create a lagged feature of 1 daytransactions['transactions_lag1'] = transactions['transactions'].shift(1)# Remove rows containing null valuestransactions.dropna(inplace=True)# Define Features and Target VariablesX_lag_transaction_day = transactions[['transactions_lag1']] # feature variabley_lag_transaction_day = transactions['transactions'] # target variable# splitting the datasetX_train_lag_transaction_day, X_test_lag_transaction_day, y_train_lag_transaction_day, y_test_lag_transaction_day = train_test_split(X_lag_transaction_day, y_lag_transaction_day, test_size=0.2, random_state=42)
Linear Regression Model
Code
# Create modeltry_model = LinearRegression()# Train modeltry_model.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionspredictions = try_model.predict(X_test_lag_transaction_day)# Evaluation of Linear Regression Model on the Test Settry_lr_eval = evaluate(y_test_lag_transaction_day, predictions)mse = mean_squared_error(y_test_lag_transaction_day, predictions)print(f'Linear Regression Test Evaluation \n{try_lr_eval}\n')# Evaluation of Linear Regression Model on the Training Settrain_predictions_lag = try_model.predict(X_train_lag_transaction_day)train_lr_eval_lag = evaluate(y_train_lag_transaction_day, train_predictions_lag)train_mse_lag = mean_squared_error(y_train_lag_transaction_day, train_predictions_lag)print(f'Linear Regression Training Evaluation (Lagged Features):\n{train_lr_eval_lag}\n')
Linear Regression Test Evaluation
MAE 607.888349
MAPE 0.458032
dtype: float64
Linear Regression Training Evaluation (Lagged Features):
MAE 617.495191
MAPE 0.456793
dtype: float64
Decision Tree Model
Code
# Create Decision Tree Modeldt_model_lag = DecisionTreeRegressor(max_depth=5, min_samples_leaf=3)# Train modeldt_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionsdt_predictions_test = dt_model_lag.predict(X_test_lag_transaction_day)dt_predictions_train = dt_model_lag.predict(X_train_lag_transaction_day)# Evaluation of Decision Tree Model on the Test Setdt_eval_test = evaluate(y_test_lag_transaction_day, dt_predictions_test)print(f'Decision Tree Test Evaluation \n{dt_eval_test}\n')# Evaluation of Decision Tree Model on Training Setdt_eval_train = evaluate(y_train_lag_transaction_day, dt_predictions_train)print(f'Decision Tree Training Evaluation \n{dt_eval_train}\n')
Decision Tree Test Evaluation
MAE 584.448378
MAPE 0.449514
dtype: float64
Decision Tree Training Evaluation
MAE 592.12036
MAPE 0.44528
dtype: float64
Random Forest Model
Code
# Create a Random Forest Modelrf_model_lag = RandomForestRegressor(n_estimators=100, max_depth=30)# Train modelrf_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionsrf_predictions_test = rf_model_lag.predict(X_test_lag_transaction_day)rf_predictions_train = rf_model_lag.predict(X_train_lag_transaction_day)# Evaluation of Random Forest Model on the Test Setrf_eval_test = evaluate(y_test_lag_transaction_day, rf_predictions_test)print(f'Random Forest Test Evaluation \n{rf_eval_test}\n')# Evaluation of Random Forest Model on the Training Setrf_eval_train = evaluate(y_train_lag_transaction_day, rf_predictions_train)print(f'Random Forest Training Evaluation \n{rf_eval_train}\n')
Random Forest Test Evaluation
MAE 606.791550
MAPE 0.463317
dtype: float64
Random Forest Training Evaluation
MAE 569.508599
MAPE 0.428687
dtype: float64
We can see that the Mean Absolute Errors (MAE) of the three models are all approximately around 400-600, while the Mean Absolute Percentage Errors (MAPE) are only between 0.4 and 0.6. The results are not ideal, so we attempt to incorporate more complex and diverse data for the third attempt
show_plot(px.scatter( test_results, x='Actual_Transactions', y=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'], labels={'value': 'Predicted Transactions'}, title='Comparison of Actual vs. Predicted Transactions'))
Code
# Merge the prediction results of each model into a DataFrametest_model = test_results.melt(id_vars=['Actual_Transactions'], value_vars=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'], var_name='Model', value_name='Predictions')# Map the model names to a more readable formatmodel_names = {'LR_Predictions': 'Linear Regression','DT_Predictions': 'Decision Tree','RF_Predictions': 'Random Forest'}test_model['Model'] = test_model['Model'].map(model_names)# Create Scatter Plotshow_plot(px.scatter( test_model, x='Actual_Transactions', y='Predictions', facet_col='Model', color='Model', labels={'Predictions': 'Predicted Transactions'}, title='Comparison of Actual vs. Predicted Transactions').update_traces(marker=dict(size=3)))
Third attempt
We have incorporated data on oil prices, merging transactions from various stores into the total national transaction volume within a day. Lag features for each month of the year and each day have been added. We also cleaned up some rows with NaN values. In contrast to the first attempt, we utilized data from 2013 to 2017, anticipating that such data would provide better training material.
Prepare Data
Code
# Adding lag data# Remove all categorical data from sales, only leaving quantitative dataquantitative_sales = sales.filter(['date', 'transactions', 'dcoilwtico'])# Ensure the date column is in datetime formatquantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])# Sum transactions and average dcoilwtico to get the total transactions for each day nationwidequantitative_sales = quantitative_sales.groupby('date').agg({'transactions': 'sum', # Sum the transactions column'dcoilwtico': 'mean'# Average the dcoilwtico column}).reset_index()quantitative_sales = quantitative_sales[ (quantitative_sales['date'] >='2013-01-02') & (quantitative_sales['date'] <='2017-08-15')]# Use linear interpolation to fill in NaN values in the oil price (dcoilwtico)quantitative_sales['dcoilwtico'] = quantitative_sales['dcoilwtico'].interpolate(method='linear')# Create a year columnquantitative_sales['year'] = quantitative_sales['date'].dt.year# Create a month columnquantitative_sales['month'] = quantitative_sales['date'].dt.month# Create a column for the year and the day of the yearquantitative_sales['year'] = quantitative_sales['date'].dt.yearquantitative_sales['day_of_year'] = quantitative_sales['date'].dt.dayofyear# Create a lag feature for the same month each year and use linear interpolation to handle NaN valuesquantitative_sales['lag_same_month'] = quantitative_sales.groupby(['year', 'month'])['transactions'].shift(1)quantitative_sales['lag_same_month'] = quantitative_sales['lag_same_month'].interpolate(method='linear')# Apply lag directly to the entire time series and use forward fill to handle NaN valuesquantitative_sales['lag_transactions'] = quantitative_sales['transactions'].shift(1)quantitative_sales['lag_transactions'] = quantitative_sales['lag_transactions'].fillna(method='ffill')# Remove the row of 2013-01-01 because it contains NaNquantitative_sales = quantitative_sales.dropna()# Remove the columns of year, month, and day_of_year to prevent them from impacting the modelquantitative_sales = quantitative_sales.drop(['year', 'month', 'day_of_year'], axis=1)# Try removing the oil price (dcoilwtico) column to experiment with the model#quantitative_sales = quantitative_sales.drop(['dcoilwtico'], axis=1)
Because during the process of troubleshooting, it was considered that oil prices might be considered noise for the model. Therefore, an attempt was made to remove the oil price data in the middle, but there was no progress, so it was commented out.
Modeling
Code
# Adding holiday data#Create a new column and initialize it to Falseholidays_events['is_holiday'] =False# Update the 'is_holiday' columnfor index, row in holidays_events.iterrows():if row['type'] =='Holiday'andnot row['transferred']: holidays_events.at[index, 'is_holiday'] =Trueelif row['type'] in ['Transfer', 'Additional']: holidays_events.at[index, 'is_holiday'] =True# Keep the required columnsholidays = holidays_events[['date', 'is_holiday']]# Merge the holiday data with quantitative_salesquantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])holidays['date'] = pd.to_datetime(holidays['date'])quantitative_sales = pd.merge(quantitative_sales, holidays, on='date', how='left')# Fill in the missing 'is_holiday' values as False after mergingquantitative_sales['is_holiday'] = quantitative_sales['is_holiday'].fillna(False)
/tmp/ipykernel_491/959183010.py:17: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
When training the model, not including the boolean values for holidays can make the model more inaccurate for certain dates with extreme trading volumes. Therefore, in this case, organizing holiday data and extracting it as a feature variable for training the model.
Code
#Try reducing the data to see if the model's performance will improve# Select data between 2016-08-14 and 2017-08-14# Attempt to narrow down the data to see if it improves model efficiency# Select data between 2016-08-14 and 2017-08-14# quantitative_sales = quantitative_sales[(quantitative_sales['date'] >= pd.Timestamp('2016-08-14')) &(quantitative_sales['date'] <= pd.Timestamp('2017-08-14'))]
When attempting modeling later on, we tried experimenting with a shorter time to see if it would improve the model’s performance. However, in reality, there wasn’t much improvement, so we commented it out.
Code
# Because the data for New Year's and Christmas is extremely limited, very small, so Boolean values are used to mark these days.# Create a new column to mark Christmas and New Year.# Marking New Year's Day and Christmas as extreme cases because their data is very small# Create a new column to mark Christmas and New Year's Dayquantitative_sales['is_christmas_or_newyear'] = quantitative_sales['date'].apply(lambda x: x.month ==12and x.day ==25or x.month ==1and x.day ==1)
Two extreme data points were added for boolean value tagging. Since the trading volumes on December 24th and January 1st are both 0 every year, boolean values are used to mark these two extreme cases in an attempt to improve model performance.
lr_model = LinearRegression()lr_model.fit(X_train, y_train)lr_predictions = lr_model.predict(X_test)lr_eval = evaluate(y_test, lr_predictions)print(f'Linear Regression Test Evaluation:\n{lr_eval}\n')lr_train_predictions = lr_model.predict(X_train)lr_train_eval = evaluate(y_train, lr_train_predictions)print(f'Linear Regression Training Evaluation:\n{lr_train_eval}\n')
Linear Regression Test Evaluation:
MAE 1.250002e+04
MAPE 3.388676e+17
dtype: float64
Linear Regression Training Evaluation:
MAE 1.215187e+04
MAPE 3.011712e+17
dtype: float64
Decision tree Model
Code
dt_model = DecisionTreeRegressor(max_depth=5,min_samples_leaf=3)dt_model.fit(X_train, y_train)dt_predictions = dt_model.predict(X_test)dt_eval = evaluate(y_test, dt_predictions)print(f'Decision Tree Regressor Test Evaluation:\n{dt_eval}\n')dt_train_predictions = dt_model.predict(X_train)dt_train_eval = evaluate(y_train, dt_train_predictions)print(f'Decision Tree Training Evaluation:\n{dt_train_eval}\n')
Decision Tree Regressor Test Evaluation:
MAE 11716.169815
MAPE 0.270153
dtype: float64
Decision Tree Training Evaluation:
MAE 9.717719e+03
MAPE 3.861322e+15
dtype: float64
Random Forest Model
Code
rf_model = RandomForestRegressor(n_estimators=1,max_depth=15)rf_model.fit(X_train, y_train)rf_predictions = rf_model.predict(X_test)rf_eval = evaluate(y_test, rf_predictions)print(f'Random Forest Test Evaluation:\n{rf_eval}\n')rf_train_predictions = rf_model.predict(X_train)rf_train_eval = evaluate(y_train, rf_train_predictions)print(f'Random Forest Training Evaluation:\n{rf_train_eval}\n')
Random Forest Test Evaluation:
MAE 14389.142851
MAPE 0.275480
dtype: float64
Random Forest Training Evaluation:
MAE 5311.977238
MAPE 0.064005
dtype: float64
It can be observed that under this massive and complex dataset, the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of the linear regression model have reached significant values. While the decision tree model has a test set MAPE of only 0.27, the training set MAPE is an extremely large number, indicating insufficient performance of the model. On the other hand, the Random Forest model, although marginally usable, has a test set MAPE ranging from approximately 0.25 to 0.31, while the training set MAPE is between 0.05 and 0.06. This demonstrates a typical overfitting phenomenon. Unfortunately, despite adjusting the maximum depth, the changes in the numbers show no clear pattern, and a maximum depth of 15 appears to yield the best results at present.
Cross-validation
Code
# cross-validation of Linear Regression Models# linear regression model's MAElr_mae_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_absolute_error')lr_mae_avg =-lr_mae_scores.mean()print(f'Linear Regression - Avg MAE: {lr_mae_avg}')
Linear Regression - Avg MAE: 12493.208182913555
Code
# Cross-validation of Decision Tree Model# decision tree model's MAEdt_mae_scores = cross_val_score(DecisionTreeRegressor(max_depth=5, min_samples_leaf=3), X, y, cv=5, scoring='neg_mean_absolute_error')dt_mae_avg =-dt_mae_scores.mean()print(f'Decision Tree - Avg MAE: {dt_mae_avg}')
Decision Tree - Avg MAE: 11585.983434735343
Code
# Cross-validation of Random Forest model# Random Forest model's MAErf_mae_scores = cross_val_score(RandomForestRegressor(n_estimators=1, max_depth=30), X, y, cv=5, scoring='neg_mean_absolute_error')rf_mae_avg =-rf_mae_scores.mean()print(f'Random Forest - Avg MAE: {rf_mae_avg}')
Random Forest - Avg MAE: 13844.625072886298
We attempted cross-validation but were unable to alter the model’s MAE score. So far, we have tried various adjustments, but none have fundamentally changed the performance of the random forest model we trained. We suspect that using too much data to train the model has increased its complexity, and this attempt should be considered among our limitations.
Findings
In this project, our goal was to use time series analysis to forecast sales trends and predict transation trends of stores in Ecuador. The dataset we used was sourced from Kaggle, particularly from a competition aimed at predicting sales in Ecuadorian stores, with data ranging from 2012 to 2017. In our first attempt, given the large scale of the dataset, we limited our data range to August 15, 2016, to August 15, 2017, to ensure the feasibility of our analysis and the accuracy of our model predictions. Sales data was only present in the original training dataset provided; since we chose not to use the original training data, we utilized transaction data to generate a training set for prediction. In terms of model building, we opted for random forest regression and decision tree regression as our primary predictive tools. The results of these two predictive models were similar, with random forest regression performing slightly better.
In the second attempt, we started experimenting with incorporating time lags into the original transaction data provided by the competition, which used the whole dataset from 2012 to 2017. We selected three models for our analysis: linear regression, decision trees, and random forests, with the newly added lagged transaction data as our feature. The objective remained the prediction of transactions. However, the results for all three models were not ideal, with the Mean Absolute Error (MAE) ranging around 400-600 and the Mean Absolute Percentage Error (MAPE) between 0.4 and 0.6. To improve our results, we decided to add more complex and diverse data in our third attempt.
In this third attempt, we included oil price data and combined transactions from various stores to represent the total national transactions per day. We added lag features for the same month each year and daily lags, while also removing rows with NaN values. Differing from our first approach, we utilized all available data from 2013 to 2017, hoping that this would provide better training material. During our model training, when trying to eliminate errors, we considered the possibility that oil prices might be acting as noise in our model. We attempted to remove the oil price data, but this didn’t lead to any significant progress, so we decided to comment it out. We also noted that excluding holiday booleans made the model less accurate due to extreme transaction volumes on specific dates. Therefore, we organized holiday data and extracted it as a feature for model training.
Subsequently, we experimented with using shorter time periods to see if it would enhance model performance, but there was no significant improvement, leading us to comment out this approach as well. We marked two extreme data points with boolean values, as transaction volumes are zero on December 24th and January 1st each year, trying to improve model performance. With these complex datasets, the linear regression model’s MAE and MAPE reached very high values. The decision tree had a test set MAPE of only 0.27, but an excessively large MAPE on the training set, indicating insufficient performance. The most usable model was the random forest, with a test set MAPE of about 0.25-0.31 and a training set MAPE of 0.05-0.06, a typical case of overfitting. Unfortunately, adjusting the maximum depth did not result in a regular pattern of improvement, with a depth of 15 providing the best results so far. Attempts at cross-validation also failed to change the MAE score of the model significantly.
Up to this point, we’ve tried various adjustments but have been unable to fundamentally change the performance of the model trained with the random forest. We surmise that using an excessive amount of data for training increased complexity, and this should be considered a limitation of our approach.
Limitations and Social / Ethical Considerations
In our project, we focused primarily on forecasting time series sales in the Ecuadorian retail market. Our analysis revealed several key limitations. Firstly, since our dataset was mainly focused on sales data from a specific region, this could have introduced a geographical bias, limiting the applicability of our analysis to other areas. Due to the complexity of the data, our method of imputing missing oil prices might not have fully accounted for the potential impact of oil price fluctuations on sales data, which could affect the accuracy of our predictive models. In this project, we primarily used linear regression, decision trees, and random forest models, but these traditional statistical models might not be sufficient to capture the full complexity of the data, especially when considering the relationships between sales and external factors like holidays or economic indicators.
Future Directions
During the course of the project, we identified several new research directions. For instance, we found that exploring deeper relationships between sales data and factors like weather, holidays, and even socio-economic variables could be very valuable. This would not only require the collection of a broader range of data, such as weather records and macroeconomic indicators, but might also necessitate the use of more advanced analysis techniques, like deep learning models, which can more effectively capture and predict complex time series data. Additionally, comparing sales patterns between different regions or types of stores is an intriguing direction, which would require us to integrate a more diverse set of datasets and possibly involve new data integration and cleaning techniques. Through these new research directions, we hope to gain a more comprehensive understanding of consumer behavior and improve the accuracy and reliability of sales forecasts.
Source Code
---title: "Store Sales - Time Series Forecasting"author: "Ben Elpidius, Danyili Hong, Honglei Yang, John Ju"format: html: embed-resources: true code-tools: true code-fold: true---# Load Packages```{python}from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_errorfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import make_scorerimport pandas as pdimport plotly.express as pxfrom sklearn.metrics import accuracy_scoreimport plotly.io as piofrom sklearn.ensemble import HistGradientBoostingClassifierimport plotly.graph_objs as goimport math``````{python}#| echo: false# DATA 202 hack for displaying plotly within RStudio:if'r'inglobals() and r['.Platform$GUI'] =="RStudio"and r['suppressMessages(requireNamespace("htmltools"))']: r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] andNonedef show_plot(p): r.to_html(p._repr_html_())else:def show_plot(p): return p# End hack```# Overview ### ThemeThe theme of this kaggle competition is to forecast grocery sales in Ecuador. What makes it interesting is that it not only provides basic data and turnover data for local stores but also includes data on Ecuador's oil prices, holidays, and events. These datasets are not directly related to grocery sales data but do impact grocery sales. This aspect makes the project both intriguing and challenging. In this project, we will predict how these external factors affect the transactions and sales of grocery stores. ### WhyWe decided to take on this kaggle because it will be helpful for business owners and store owners. It would be good to know when to open up a store and what to expect from it such as sales and transactions based on different external factors that will affect it. Additionally, store owners should also know sales and transactions forecast to understand when to restock or buy items, or when to close their store since nothing is selling during this time.### QuestionHow is the trend of transaction in Ecuador changing overtime, which is also affected by different external factors such as oil and holiday, impacting the total store sales. We aim to explore how these external factors have influenced local store transactions and attempt to predict their impact on transactions. If possible, we will compare the forecast results with and without these external factors to see the different and predict the sales after 2017 (2017-2018).### DatasetThe dataset consists of four parts. The stores.csv contains basic information about local stores, including store numbers, city and state locations, as well as the type and cluster of the store. The holidays_events.csv covers all holidays and events from 2012 to 2017, categorized as local, regional, and national. The oil.csv represents Ecuador's daily oil prices, a significant factor affecting the local economy. Lastly, the transactions.csv details the daily transaction numbers for each store identified by its number. Additionally, the original dataset also consist of their own test and train dataset. We will use their train and test dataset to forecast previous sales, since sales is created through their own calculation, as well as making our own train and test dataset to predict future transaction. # Prior Work### Summary of Findings##### EKREM BAYAR: https://www.kaggle.com/code/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guideProject includes the sales for each product family and store combinations, the transactions over year, interpolation for oil prices, data manipulation for holiday and events data, effects of outside variables such as earthquakes. #### HOWOO JANG: https://www.kaggle.com/code/howoojang/first-kaggle-notebook-following-ts-tutorialProject includes relationship between sales, transaction, and oil prices. Also includes effects of holidays and seasonality, and forecasting with machine learning.### Critique of prior work##### Ekrem Bayar1. What did they do well that is inspiring?They explained every step they worked on with a reasonable texts and plot to show the evidence. 2. What could they improve on or explore further?I think they did a perfect job on this project, and the only defect I would say it that the dataset is too big so whenever we try to interpret a plot, we can only tell the trend but details. 3. Do you trust their results? Why or why not?Yes, I do. Like I mentioned before, they came up with reasonable explanations and plots to validate their results. ##### Howoo Jang1. What did they do well that is inspiring?They gave conclusions for each step which helps viewers understand. 2. What could they improve on or explore further?They didn't have any introduction about the project in the beginning and started with bunch of codes which makes the viewers confused. 3. Do you trust their results? Why or why not?I do trust their results because they gave the reason for every chunk of codes and conclusions of the results. ### ExtensionIn this project, we will be extending prior work by comparing the forecast and model with and without external factors and predicting the future transaction. This will include shifting the transaction value down by one day.# Approach### Problem DescriptionWe will predict the transactions for the product at stores located in Ecuador after 2017. Through the variable transactions, oil price, and holiday, we will try to predict the future transactions. In order to do this, we will analyze the whole data to predict the future transaction. However, due to the extensive scale of the dataset, to ensure the feasibility of the analysis and the accuracy of the model predictions, we have narrowed down the data, selecting only the period from August 15, 2016, to August 15, 2017, for the first attempt. The second and third attempt uses the whole data, which made it more complicated and less accurate. ### ProvenanceLink: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data?select=train.csvThe data set used in this project originates from the Kaggle website, specifically from a competition focused on forecasting sales in Ecuadorian stores. The data spans from 2012 to 2017, counting each day.### Data```{python}train = pd.read_csv("data/train.csv")test = pd.read_csv("data/test.csv")transactions = pd.read_csv("data/transactions.csv")stores = pd.read_csv("data/stores.csv")oil = pd.read_csv("data/oil.csv")holidays_events = pd.read_csv("data/holidays_events.csv")```### Features and Records```{python}transactions.info()transactions.head(2)```On the day, 2013-01-01, the store with store number 25 had total transaction of 770. ```{python}stores.info()stores.head(2)```Store number 1 is in the city of Quito, Pichincha State with store type D and cluster(cluster is a grouping of similar stores) 13. ```{python}holidays_events.info()holidays_events.head(2)```It was a local holiday on 2012-03-02 called Fundacion de Manta which is celebrated in Manta. ```{python}oil.info()oil.head(2)```The daily oil price wasn't recorded on the day 2013-01-01. ### Data AnalysisIt is easy to merge the dataset together because the key value would be date and store number. These two key value are very useful and common. Additionally, the date records each day which makes it very detailed and useful for interpretation. However, because of that it is very complex. Some dates are missing or not recorded for some store numbers, which could be a result of no transactions or sales being made for that store. ### Data Wrangling```{python}sales = pd.merge(transactions, stores, on='store_nbr', how='outer')sales = pd.merge(sales, holidays_events, on="date", how="outer")sales = pd.merge(sales, oil, on="date", how="outer")sales.head(5)```# Exploratory Data Analysis (EDA)```{python}show_plot( px.box( sales, x ="city", y ="transactions" ))```From this chart, we can observe that the transaction count in Quito is significantly higher than in other cities. This could be because Quito is the capital, and its residents may have greater purchasing power.```{python}header_values = ['KPI', 'Value']cell_values = [ ['Number of Stores', 'Number of Different Products', 'Window Start Date', 'Window End Date','#Rows in training set', '#Date Points'], [train['store_nbr'].nunique(), train['family'].nunique(), train['date'].min(), train['date'].max(), train.shape[0], train['date'].nunique()]]show_plot(go.Figure(data=[go.Table(header=dict(values=header_values), cells=dict(values=cell_values))]))``````{python}train_aux = train[['date', 'sales', 'onpromotion']].groupby('date').mean()train_aux = train_aux.reset_index()show_plot(go.Figure(data=go.Scatter(x=train_aux['date'], y=train_aux['sales'], marker_color='red', text="sales")).update_layout({"title": 'Avg Sales by date for all stores and products',"xaxis": {"title":"Date"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))```Increasing trend of the sales. In the last two years (since July 2015) the trend has been stable (almost stationary TS).By zooming in one can realize that there is a seasonality every 7 days, same pattern (during the weekends higher sales). The peak of the weekly sesoanlity is on Sundays/Saturdays. The 1st of Jan of every year the supermarkets are not open (sales=0).```{python}show_plot((px.scatter(train_aux[train_aux['onpromotion'] >0], x="onpromotion", y="sales", color='sales', color_continuous_scale="earth", size='sales', log_x=True, size_max=30).update_layout({"title": 'Correlation between OnPromotion and Sales (total avg sales and promotion each day)',"xaxis": {"title":"On Promotion"},"yaxis": {"title":"Sales"},"showlegend": False})))```From this chart, we can see that promotional tactics have a significant impact on increasing sales volume.```{python}train['year'] = pd.to_datetime(train['date']).dt.yeartrain['month'] = pd.to_datetime(train['date']).dt.strftime("%B")train['day_of_week'] = pd.to_datetime(train['date']).dt.day_name()df_year = train.groupby('year').mean(numeric_only=True)[['sales']]df_year = df_year.reset_index()df_month = train.groupby('month').mean(numeric_only=True)[['sales']]new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']df_month = df_month.reindex(new_order, axis=0)df_month = df_month.reset_index()df_day_of_week = train.groupby('day_of_week').mean(numeric_only=True)[['sales']]new_order_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]df_day_of_week = df_day_of_week.reindex(new_order_week, axis=0)df_day_of_week = df_day_of_week.reset_index()show_plot( px.bar(df_year, x='year', y='sales', color='sales', color_continuous_scale="darkmint") .update_layout({"title": f'AVG SALES BY YEAR',"xaxis": {"title":"YEAR"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))```From this chart, we can see that from 2013 to 2017, the overall sales volume trend has been steadily increasing.```{python}show_plot( px.bar(df_month, x='month', y='sales', color='sales', color_continuous_scale="puor").update_layout({"title": f'AVG SALES BY MONTH',"xaxis": {"title":"MONTH"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))```From this chart, we can see that the sales volume in December is significantly higher than in other months. This could be due to major holidays contributing to increased purchases, such as Christmas.```{python}show_plot(px.bar(df_day_of_week, x='day_of_week', y='sales', color='sales', color_continuous_scale="speed") .update_layout({"title": 'AVG SALES BY DAY OF WEEK',"xaxis": {"title":"DAY OF WEEK"},"yaxis": {"title":"Avg Unit Sold"},"showlegend": False}))```We can see that the sales volume on weekends is significantly higher than during the weekdays. ```{python}previous_price =93.14for index, row in oil.iterrows():if math.isnan(row['dcoilwtico']): oil.loc[oil['date'] == row['date'], 'dcoilwtico'] = previous_priceelse: previous_price = row['dcoilwtico']``````{python}train_aux = train[['date', 'sales']].groupby('date').mean()train_aux = train_aux.reset_index()show_plot(go.Figure(data=go.Scatter(x=oil['date'], y=oil['dcoilwtico'], marker_color='blue', text="sales")).update_layout({"title": f'Oil Prices Chart',"xaxis": {"title":"Date"},"yaxis": {"title":"Oil Price"},"showlegend": False}))```From the chart, we can observe that oil prices were relatively stable from 2013 to 2014. However, there was a significant drop in oil prices from 2014 to 2015. This downward trend continued into 2016. Starting from 2016, oil prices began to stabilize again, but they were still considerably lower compared to the levels in 2013.```{python}sales_oil = train.groupby('date').mean()['sales']sales_oil = sales_oil.reset_index()sales_oil = pd.merge(sales_oil, oil, on ='date', how='left')# we don't have all the oil prices available, we impute them previous_price =93.14for index, row in sales_oil.iterrows():if math.isnan(row['dcoilwtico']): sales_oil.loc[sales_oil['date'] == row['date'], 'dcoilwtico'] = previous_priceelse: previous_price = row['dcoilwtico']``````{python}show_plot(px.scatter(sales_oil, x="dcoilwtico", y="sales", size='sales', color='sales', color_continuous_scale="pinkyl").update_layout({"title": f'Correlation between Oil Prices and Total Average Sales',"xaxis": {"title":"Oil Price"},"yaxis": {"title":"Sales"},"showlegend": False}))```We observed that higher oil prices correspond to a significant decrease in sales values.```{python}show_plot( px.line( sales, x ="date", y ="transactions", facet_col ="type_x", facet_col_wrap =2 ))```Type a and type d seems to have higher transactions, but we are not sure what these type mean because it is not written in the description in kaggle.### SummaryEach EDA graph shows a clear trend, whether it is increasing or decreasing. Each graph shows that the sales and transactions are increasing every year or if a positive external factor is applied such as promotion, holidays, and season. Sales and transactions are decreasing when the oil price is increasing, which makes since oil prices reflects the price of products to be more expensive. # Modeling```{python}def evaluate(y_true, y_pred):return pd.Series({'MAE': mean_absolute_error(y_true, y_pred),'MAPE': mean_absolute_percentage_error(y_true, y_pred),# 'MSE': mean_squared_error(y_true, y_pred), })```The variable we are trying to predict is transaction. We will attempt training the model in 3 different ways. ### First attempt```{python}transactions_query = sales.query("'2016-08-15' <= date <= '2017-08-15'")transactions_query.head(5)```The first approach involved the direct use of data from August 15, 2016, to August 15, 2017. We merged the original data ourselves, including stores.csv, oil.csv, holidays_events.csv, and transaction.csv. The resulting dataframe was not based on the train and test data provided by this competition. However, this time, we did not attempt to incorporate time lags.##### Setup```{python}#Selecting specific columns and removing missing valuesnew_transactions = transactions_query[['date', 'store_nbr', 'transactions']]new_transactions = new_transactions.dropna()#Define target variable and featurestarget ="transactions"features = [i for i in new_transactions.columns if i != target]train_original, test_original = train_test_split(new_transactions, test_size=0.2, random_state=42)print(f"train_original: {len(train_original)} transactions, test_original: {len(test_original)} transactions")train_original['date'] = pd.to_datetime(new_transactions['date'])train_original['date'] = train_original['date'].apply(lambda x: x.timestamp())test_original['date'] = pd.to_datetime(new_transactions['date'])test_original['date'] = test_original['date'].apply(lambda x: x.timestamp())```We will use date and store_nbr as features because we want to know the total transaction of each store per dates. We will use MAE and MAPE to measure accuracy. We expect these three models to produce relatively normal MAE and MAPE values. Also, we will use linear regression, random forest regressor, and decision tree regressor to validate because our data is continuous. ##### Linear Regression```{python}lr = LinearRegression().fit( X=train_original[features], y=train_original[target])train_original['lr'] = lr.predict(train_original[features])evaluate(train_original[target], train_original['lr'])test_original['lr'] = lr.predict(test_original[features])evaluate(test_original[target], test_original['lr'])```##### Random Forest Regressor```{python}rf = RandomForestRegressor(max_depth=25).fit( X=train_original[features], y=train_original[target])train_original['rf'] = rf.predict(train_original[features])evaluate(train_original[target], train_original['rf'])test_original['rf'] = rf.predict(test_original[features])evaluate(test_original[target], test_original['rf'])```##### Decision Tree Regressor```{python}dr = DecisionTreeRegressor(max_depth=25).fit( X=train_original[features], y=train_original[target])train_original['dr'] = dr.predict(train_original[features])evaluate(train_original[target], train_original['dr'])test_original['dr'] = dr.predict(test_original[features])evaluate(test_original[target], test_original['dr'])```We merely adjusted the maximum depth of the Random Forest to achieve better MAE and MAPE values. Currently, Random Forest Regressor performs the best because the test result MAE is the lowest, 177.701769. However, with a training set MAPE of only 0.05 and a testing set MAPE of just 0.11, this indicates a potential risk of overfitting. Next, we will proceed with the cross-validation of these models.##### Cross Validation```{python}# Cross-validation of Random Forest Model# Reprocess the data to ensure that the date column is converted to a timestampnew_transactions['date'] = pd.to_datetime(new_transactions['date'])new_transactions['date'] = new_transactions['date'].apply(lambda x: x.timestamp())# Update the feature list, ensuring that the target column is not includedfeatures = [i for i in new_transactions.columns if i != target]# perform cross-validationrf_scores = cross_val_score(RandomForestRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')rf_mae =-rf_scores.mean()print(f"Random Forest MAE: {rf_mae}")``````{python}# Cross-validation of decision tree modelsdr_scores = cross_val_score(DecisionTreeRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')dr_mae =-dr_scores.mean()print(f"Decision Tree MAE: {dr_mae}")```##### Visualization of Model```{python}# Compare the two modeltest_model = test_original.melt( id_vars=['transactions'] + features, value_vars=['dr', 'rf'], var_name='model', value_name='prediction')test_model['model'] = test_model['model'].replace({'dr': 'Decision Regressor','rf': 'Random Regressor',})test_model['resid'] = test_model['transactions'] - test_model['prediction']test_model.head()``````{python}show_plot( px.scatter( test_model, x="transactions", y="prediction", facet_col="model" ) .update_traces( marker_size=3, ))```The outcomes from these two predictive models were different, with the Random Forest Regression showing superior performance.MAE 67.768521MAPE 0.047034dtype: float64MAE 177.656220MAPE 0.111345dtype: float64After cross-validation,Random Forest MAE: 276.70372007575213### Second attemptThe second time, we started to attempt incorporating time lag into the original transaction data provided by the competition. We still chose three models: linear regression, decision tree, and random forest. The features consist of the newly added lagged data of transactions. The target variable remains the prediction of transactions.##### Lagged Features```{python}# Lagging Feature first try# Train the model with data lagged/shift by one day# tidy datatransactions['date'] = pd.to_datetime(transactions['date'])transactions.sort_values(by='date', inplace=True)# Create a lagged feature of 1 daytransactions['transactions_lag1'] = transactions['transactions'].shift(1)# Remove rows containing null valuestransactions.dropna(inplace=True)# Define Features and Target VariablesX_lag_transaction_day = transactions[['transactions_lag1']] # feature variabley_lag_transaction_day = transactions['transactions'] # target variable# splitting the datasetX_train_lag_transaction_day, X_test_lag_transaction_day, y_train_lag_transaction_day, y_test_lag_transaction_day = train_test_split(X_lag_transaction_day, y_lag_transaction_day, test_size=0.2, random_state=42)```##### Linear Regression Model```{python}# Create modeltry_model = LinearRegression()# Train modeltry_model.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionspredictions = try_model.predict(X_test_lag_transaction_day)# Evaluation of Linear Regression Model on the Test Settry_lr_eval = evaluate(y_test_lag_transaction_day, predictions)mse = mean_squared_error(y_test_lag_transaction_day, predictions)print(f'Linear Regression Test Evaluation \n{try_lr_eval}\n')# Evaluation of Linear Regression Model on the Training Settrain_predictions_lag = try_model.predict(X_train_lag_transaction_day)train_lr_eval_lag = evaluate(y_train_lag_transaction_day, train_predictions_lag)train_mse_lag = mean_squared_error(y_train_lag_transaction_day, train_predictions_lag)print(f'Linear Regression Training Evaluation (Lagged Features):\n{train_lr_eval_lag}\n')```##### Decision Tree Model```{python}# Create Decision Tree Modeldt_model_lag = DecisionTreeRegressor(max_depth=5, min_samples_leaf=3)# Train modeldt_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionsdt_predictions_test = dt_model_lag.predict(X_test_lag_transaction_day)dt_predictions_train = dt_model_lag.predict(X_train_lag_transaction_day)# Evaluation of Decision Tree Model on the Test Setdt_eval_test = evaluate(y_test_lag_transaction_day, dt_predictions_test)print(f'Decision Tree Test Evaluation \n{dt_eval_test}\n')# Evaluation of Decision Tree Model on Training Setdt_eval_train = evaluate(y_train_lag_transaction_day, dt_predictions_train)print(f'Decision Tree Training Evaluation \n{dt_eval_train}\n')```##### Random Forest Model```{python}# Create a Random Forest Modelrf_model_lag = RandomForestRegressor(n_estimators=100, max_depth=30)# Train modelrf_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)# Conduct predictionsrf_predictions_test = rf_model_lag.predict(X_test_lag_transaction_day)rf_predictions_train = rf_model_lag.predict(X_train_lag_transaction_day)# Evaluation of Random Forest Model on the Test Setrf_eval_test = evaluate(y_test_lag_transaction_day, rf_predictions_test)print(f'Random Forest Test Evaluation \n{rf_eval_test}\n')# Evaluation of Random Forest Model on the Training Setrf_eval_train = evaluate(y_train_lag_transaction_day, rf_predictions_train)print(f'Random Forest Training Evaluation \n{rf_eval_train}\n')```We can see that the Mean Absolute Errors (MAE) of the three models are all approximately around 400-600, while the Mean Absolute Percentage Errors (MAPE) are only between 0.4 and 0.6. The results are not ideal, so we attempt to incorporate more complex and diverse data for the third attempt##### Visualization of Model```{python}# Merge Predicted Resultstest_results = X_test_lag_transaction_day.copy()test_results['Actual_Transactions'] = y_test_lag_transaction_daytest_results['LR_Predictions'] = rf_model_lag.predict(X_test_lag_transaction_day)test_results['DT_Predictions'] = dt_model_lag.predict(X_test_lag_transaction_day)test_results['RF_Predictions'] = rf_model_lag.predict(X_test_lag_transaction_day)``````{python}show_plot(px.scatter( test_results, x='Actual_Transactions', y=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'], labels={'value': 'Predicted Transactions'}, title='Comparison of Actual vs. Predicted Transactions'))``````{python}# Merge the prediction results of each model into a DataFrametest_model = test_results.melt(id_vars=['Actual_Transactions'], value_vars=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'], var_name='Model', value_name='Predictions')# Map the model names to a more readable formatmodel_names = {'LR_Predictions': 'Linear Regression','DT_Predictions': 'Decision Tree','RF_Predictions': 'Random Forest'}test_model['Model'] = test_model['Model'].map(model_names)# Create Scatter Plotshow_plot(px.scatter( test_model, x='Actual_Transactions', y='Predictions', facet_col='Model', color='Model', labels={'Predictions': 'Predicted Transactions'}, title='Comparison of Actual vs. Predicted Transactions').update_traces(marker=dict(size=3)))```### Third attemptWe have incorporated data on oil prices, merging transactions from various stores into the total national transaction volume within a day. Lag features for each month of the year and each day have been added. We also cleaned up some rows with NaN values. In contrast to the first attempt, we utilized data from 2013 to 2017, anticipating that such data would provide better training material.##### Prepare Data```{python}# Adding lag data# Remove all categorical data from sales, only leaving quantitative dataquantitative_sales = sales.filter(['date', 'transactions', 'dcoilwtico'])# Ensure the date column is in datetime formatquantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])# Sum transactions and average dcoilwtico to get the total transactions for each day nationwidequantitative_sales = quantitative_sales.groupby('date').agg({'transactions': 'sum', # Sum the transactions column'dcoilwtico': 'mean'# Average the dcoilwtico column}).reset_index()quantitative_sales = quantitative_sales[ (quantitative_sales['date'] >='2013-01-02') & (quantitative_sales['date'] <='2017-08-15')]# Use linear interpolation to fill in NaN values in the oil price (dcoilwtico)quantitative_sales['dcoilwtico'] = quantitative_sales['dcoilwtico'].interpolate(method='linear')# Create a year columnquantitative_sales['year'] = quantitative_sales['date'].dt.year# Create a month columnquantitative_sales['month'] = quantitative_sales['date'].dt.month# Create a column for the year and the day of the yearquantitative_sales['year'] = quantitative_sales['date'].dt.yearquantitative_sales['day_of_year'] = quantitative_sales['date'].dt.dayofyear# Create a lag feature for the same month each year and use linear interpolation to handle NaN valuesquantitative_sales['lag_same_month'] = quantitative_sales.groupby(['year', 'month'])['transactions'].shift(1)quantitative_sales['lag_same_month'] = quantitative_sales['lag_same_month'].interpolate(method='linear')# Apply lag directly to the entire time series and use forward fill to handle NaN valuesquantitative_sales['lag_transactions'] = quantitative_sales['transactions'].shift(1)quantitative_sales['lag_transactions'] = quantitative_sales['lag_transactions'].fillna(method='ffill')# Remove the row of 2013-01-01 because it contains NaNquantitative_sales = quantitative_sales.dropna()# Remove the columns of year, month, and day_of_year to prevent them from impacting the modelquantitative_sales = quantitative_sales.drop(['year', 'month', 'day_of_year'], axis=1)# Try removing the oil price (dcoilwtico) column to experiment with the model#quantitative_sales = quantitative_sales.drop(['dcoilwtico'], axis=1)```Because during the process of troubleshooting, it was considered that oil prices might be considered noise for the model. Therefore, an attempt was made to remove the oil price data in the middle, but there was no progress, so it was commented out.##### Modeling```{python}# Adding holiday data#Create a new column and initialize it to Falseholidays_events['is_holiday'] =False# Update the 'is_holiday' columnfor index, row in holidays_events.iterrows():if row['type'] =='Holiday'andnot row['transferred']: holidays_events.at[index, 'is_holiday'] =Trueelif row['type'] in ['Transfer', 'Additional']: holidays_events.at[index, 'is_holiday'] =True# Keep the required columnsholidays = holidays_events[['date', 'is_holiday']]# Merge the holiday data with quantitative_salesquantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])holidays['date'] = pd.to_datetime(holidays['date'])quantitative_sales = pd.merge(quantitative_sales, holidays, on='date', how='left')# Fill in the missing 'is_holiday' values as False after mergingquantitative_sales['is_holiday'] = quantitative_sales['is_holiday'].fillna(False)```When training the model, not including the boolean values for holidays can make the model more inaccurate for certain dates with extreme trading volumes. Therefore, in this case, organizing holiday data and extracting it as a feature variable for training the model.```{python}#Try reducing the data to see if the model's performance will improve# Select data between 2016-08-14 and 2017-08-14# Attempt to narrow down the data to see if it improves model efficiency# Select data between 2016-08-14 and 2017-08-14# quantitative_sales = quantitative_sales[(quantitative_sales['date'] >= pd.Timestamp('2016-08-14')) &(quantitative_sales['date'] <= pd.Timestamp('2017-08-14'))]```When attempting modeling later on, we tried experimenting with a shorter time to see if it would improve the model's performance. However, in reality, there wasn't much improvement, so we commented it out.```{python}# Because the data for New Year's and Christmas is extremely limited, very small, so Boolean values are used to mark these days.# Create a new column to mark Christmas and New Year.# Marking New Year's Day and Christmas as extreme cases because their data is very small# Create a new column to mark Christmas and New Year's Dayquantitative_sales['is_christmas_or_newyear'] = quantitative_sales['date'].apply(lambda x: x.month ==12and x.day ==25or x.month ==1and x.day ==1)```Two extreme data points were added for boolean value tagging. Since the trading volumes on December 24th and January 1st are both 0 every year, boolean values are used to mark these two extreme cases in an attempt to improve model performance.##### Preparing data for the models```{python}# Preparing dataX = quantitative_sales[['lag_same_month', 'lag_transactions', 'dcoilwtico','is_holiday','is_christmas_or_newyear']] # featuresy = quantitative_sales['transactions'] # target# Splitting dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)```##### Linear Regression Model```{python}lr_model = LinearRegression()lr_model.fit(X_train, y_train)lr_predictions = lr_model.predict(X_test)lr_eval = evaluate(y_test, lr_predictions)print(f'Linear Regression Test Evaluation:\n{lr_eval}\n')lr_train_predictions = lr_model.predict(X_train)lr_train_eval = evaluate(y_train, lr_train_predictions)print(f'Linear Regression Training Evaluation:\n{lr_train_eval}\n')```##### Decision tree Model```{python}dt_model = DecisionTreeRegressor(max_depth=5,min_samples_leaf=3)dt_model.fit(X_train, y_train)dt_predictions = dt_model.predict(X_test)dt_eval = evaluate(y_test, dt_predictions)print(f'Decision Tree Regressor Test Evaluation:\n{dt_eval}\n')dt_train_predictions = dt_model.predict(X_train)dt_train_eval = evaluate(y_train, dt_train_predictions)print(f'Decision Tree Training Evaluation:\n{dt_train_eval}\n')```##### Random Forest Model```{python}rf_model = RandomForestRegressor(n_estimators=1,max_depth=15)rf_model.fit(X_train, y_train)rf_predictions = rf_model.predict(X_test)rf_eval = evaluate(y_test, rf_predictions)print(f'Random Forest Test Evaluation:\n{rf_eval}\n')rf_train_predictions = rf_model.predict(X_train)rf_train_eval = evaluate(y_train, rf_train_predictions)print(f'Random Forest Training Evaluation:\n{rf_train_eval}\n')```It can be observed that under this massive and complex dataset, the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of the linear regression model have reached significant values. While the decision tree model has a test set MAPE of only 0.27, the training set MAPE is an extremely large number, indicating insufficient performance of the model. On the other hand, the Random Forest model, although marginally usable, has a test set MAPE ranging from approximately 0.25 to 0.31, while the training set MAPE is between 0.05 and 0.06. This demonstrates a typical overfitting phenomenon. Unfortunately, despite adjusting the maximum depth, the changes in the numbers show no clear pattern, and a maximum depth of 15 appears to yield the best results at present.##### Cross-validation```{python}# cross-validation of Linear Regression Models# linear regression model's MAElr_mae_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_absolute_error')lr_mae_avg =-lr_mae_scores.mean()print(f'Linear Regression - Avg MAE: {lr_mae_avg}')``````{python}# Cross-validation of Decision Tree Model# decision tree model's MAEdt_mae_scores = cross_val_score(DecisionTreeRegressor(max_depth=5, min_samples_leaf=3), X, y, cv=5, scoring='neg_mean_absolute_error')dt_mae_avg =-dt_mae_scores.mean()print(f'Decision Tree - Avg MAE: {dt_mae_avg}')``````{python}# Cross-validation of Random Forest model# Random Forest model's MAErf_mae_scores = cross_val_score(RandomForestRegressor(n_estimators=1, max_depth=30), X, y, cv=5, scoring='neg_mean_absolute_error')rf_mae_avg =-rf_mae_scores.mean()print(f'Random Forest - Avg MAE: {rf_mae_avg}')```We attempted cross-validation but were unable to alter the model's MAE score. So far, we have tried various adjustments, but none have fundamentally changed the performance of the random forest model we trained. We suspect that using too much data to train the model has increased its complexity, and this attempt should be considered among our limitations.# FindingsIn this project, our goal was to use time series analysis to forecast sales trends and predict transation trends of stores in Ecuador. The dataset we used was sourced from Kaggle, particularly from a competition aimed at predicting sales in Ecuadorian stores, with data ranging from 2012 to 2017. In our first attempt, given the large scale of the dataset, we limited our data range to August 15, 2016, to August 15, 2017, to ensure the feasibility of our analysis and the accuracy of our model predictions. Sales data was only present in the original training dataset provided; since we chose not to use the original training data, we utilized transaction data to generate a training set for prediction. In terms of model building, we opted for random forest regression and decision tree regression as our primary predictive tools. The results of these two predictive models were similar, with random forest regression performing slightly better.In the second attempt, we started experimenting with incorporating time lags into the original transaction data provided by the competition, which used the whole dataset from 2012 to 2017. We selected three models for our analysis: linear regression, decision trees, and random forests, with the newly added lagged transaction data as our feature. The objective remained the prediction of transactions. However, the results for all three models were not ideal, with the Mean Absolute Error (MAE) ranging around 400-600 and the Mean Absolute Percentage Error (MAPE) between 0.4 and 0.6. To improve our results, we decided to add more complex and diverse data in our third attempt. In this third attempt, we included oil price data and combined transactions from various stores to represent the total national transactions per day. We added lag features for the same month each year and daily lags, while also removing rows with NaN values. Differing from our first approach, we utilized all available data from 2013 to 2017, hoping that this would provide better training material. During our model training, when trying to eliminate errors, we considered the possibility that oil prices might be acting as noise in our model. We attempted to remove the oil price data, but this didn't lead to any significant progress, so we decided to comment it out. We also noted that excluding holiday booleans made the model less accurate due to extreme transaction volumes on specific dates. Therefore, we organized holiday data and extracted it as a feature for model training. Subsequently, we experimented with using shorter time periods to see if it would enhance model performance, but there was no significant improvement, leading us to comment out this approach as well. We marked two extreme data points with boolean values, as transaction volumes are zero on December 24th and January 1st each year, trying to improve model performance. With these complex datasets, the linear regression model's MAE and MAPE reached very high values. The decision tree had a test set MAPE of only 0.27, but an excessively large MAPE on the training set, indicating insufficient performance. The most usable model was the random forest, with a test set MAPE of about 0.25-0.31 and a training set MAPE of 0.05-0.06, a typical case of overfitting. Unfortunately, adjusting the maximum depth did not result in a regular pattern of improvement, with a depth of 15 providing the best results so far. Attempts at cross-validation also failed to change the MAE score of the model significantly. Up to this point, we've tried various adjustments but have been unable to fundamentally change the performance of the model trained with the random forest. We surmise that using an excessive amount of data for training increased complexity, and this should be considered a limitation of our approach.# Limitations and Social / Ethical ConsiderationsIn our project, we focused primarily on forecasting time series sales in the Ecuadorian retail market. Our analysis revealed several key limitations. Firstly, since our dataset was mainly focused on sales data from a specific region, this could have introduced a geographical bias, limiting the applicability of our analysis to other areas. Due to the complexity of the data, our method of imputing missing oil prices might not have fully accounted for the potential impact of oil price fluctuations on sales data, which could affect the accuracy of our predictive models. In this project, we primarily used linear regression, decision trees, and random forest models, but these traditional statistical models might not be sufficient to capture the full complexity of the data, especially when considering the relationships between sales and external factors like holidays or economic indicators.# Future DirectionsDuring the course of the project, we identified several new research directions. For instance, we found that exploring deeper relationships between sales data and factors like weather, holidays, and even socio-economic variables could be very valuable. This would not only require the collection of a broader range of data, such as weather records and macroeconomic indicators, but might also necessitate the use of more advanced analysis techniques, like deep learning models, which can more effectively capture and predict complex time series data. Additionally, comparing sales patterns between different regions or types of stores is an intriguing direction, which would require us to integrate a more diverse set of datasets and possibly involve new data integration and cleaning techniques. Through these new research directions, we hope to gain a more comprehensive understanding of consumer behavior and improve the accuracy and reliability of sales forecasts.